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Abstract 

Background: The microsporidia parasite Nosemo contributes to the steep global decline of honey bees that are 
critical pollinators of food crops. There are two species of Nosemo that have been found to infect honey bees, 
Nosema apis and N. ceranae. Genome sequencing of N. apis and comparative genome analysis with N. ceranae, a 
fully sequenced microsporidia species, reveal novel insights into host-parasite interactions underlying the parasite 
infections. 

Results: We applied the whole-genome shotgun sequencing approach to sequence and assemble the genome of 
N. apis which has an estimated size of 8.5 Mbp. We predicted 2,771 protein- coding genes and predicted the 
function of each putative protein using the Gene Ontology. The comparative genomic analysis led to identification 
of 1,356 orthologs that are conserved between the two Nosema species and genes that are unique characteristics 
of the individual species, thereby providing a list of virulence factors and new genetic tools for studying host- 
parasite interactions. We also identified a highly abundant motif in the upstream promoter regions of N. apis genes. 
This motif is also conserved in N. ceranae and other microsporidia species and likely plays a role in gene regulation 
across the microsporidia. 

Conclusions: The availability of the N. apis genome sequence is a significant addition to the rapidly expanding 
body of microsprodian genomic data which has been improving our understanding of eukaryotic genome diversity 
and evolution in a broad sense. The predicted virulent genes and transcriptional regulatory elements are potential 
targets for innovative therapeutics to break down the life cycle of the parasite. 

Keywords: Microsporidia, Nosema, Honey bees, Genome, Comparative genomics 



Background 

Microsporidia are obligate intracellular parasites that 
comprise over 1,200 described species belonging to 143 
genera and infect members of almost all animal phyla 
[1-3]. Until recently, microsporidia were regarded as 
primitive amitochondriate protozoa, because they do not 
possess typical eukaryotic organelles such as mitochon- 
dria, peroxisomes, and classical stacked Golgi apparati. 
In addition, ribosomal RNAs of microsporidia present 
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some prokaryote-like features by harboring 16S and 23S 
rRNA genes and lacking the 5.8S rRNA in their large 
ribosomal subunit that is characteristic of eukaryotes [4]. 
However, this putatively ancient origin of microsporidia 
has been contested and redefined. Recent data has 
shown the presence of a mitosome, a mitochondrion- 
derived organelle [5], and genes encoding proteins of 
mitochondrial origin such as heat shock protein 70 in 
microsporidia [6,7]. This suggests that microsporidia 
once had mitochondria and lost them through the 
course of evolution. Based on the aforementioned fac- 
tors, in conjunction with their possession of characteris- 
tic traits of closed mitosis and spores that contain chitin 
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and trehalose, as fungi do [8], microsporidia have been 
classified as the earliest diverging clade of sequenced 
fungi [9] that underwent substantial genetic changes 
during adaptive evolution. 

Nosema apis and N ceranae (Family: Nosematidae, 
Genus: Nosema) are two microsporidia species that 
infect honey bees. Like other microsporidia, the infective 
form of the Nosema is the resistant spore, which is 
surrounded by an outer layer consisting of an electron- 
dense glycoprotein exospore and an electron-lucent chi- 
tinous endospore layer separated from the cell by a thin 
plasma membrane. The spore possesses a polar tubule 
which is considered to be a specialized product of the 
Golgi apparatus and the posterior vacuole at the posterior 
pole of the spore has been postulated to function as a 
peroxisome [10-12]. In the midgut of the host, the spore 
germinates and extrudes its polar tubule and injects the 
infective sporoplasm into the host intestinal epithelium. 
Inside the host cell cytoplasm, the sporoplasm undergoes 
divisions by binary fission (merogony) or multiple fission 
(schizogony) and spore production by formation of a thick 
wall around the spore (sporogony). Repeated multiplica- 
tion results in the host cell becoming completely filled 
with spores and eventually rupturing to release the spores 
into surroundings cells [13] and to spread to other tissues 
[14,15]. Mature spores excreted in feces can also infect 
other hosts, providing new sources of the infection in bee 
colonies [16]. Parasitism by Nosema impairs the hosts 
metabolism and reproduction, and causes a destructive 
disease known as nosemosis, a digestive disorder that 
shortens honey bee lifespan, retards colony development, 
decimates bee populations, and causes colony queen 
supersedure [17-22]. N. ceranae as a destructive intracellu- 
lar parasite not only directly causes a serious disease in 
honey bees but also has the potential to compromise the 
physical and immunological barriers of honey bees toward 
disease, leaving bees more susceptible to other pathogens 
and senescence [23-25]. New evidence shows that interac- 
tions between N ceranae and neuro-active pesticides, 
neonicotinoids which have been used for controlling pests 
in the IPM system, could synergistically and negatively 
affect honey bee survival and significantly contribute to 
colony depopulation [26-30]. So far, the only registered 
treatment for Nosema disease is fumagillin, whose use is 
forbidden in Europe because it has no established 
Maximun Residue Level (MRL). With prolonged use of 
fumagillin, disease resistance to treatment has become an 
issue [31]. As a result, additional therapeutic options are 
urgently needed. 

European honey bees, Apis mellifera, are the most 
important insect pollinator and are responsible for the 
pollination of one third of agricultural food crops in the 
world with an estimated value of $216/€153 billion per 
year [32]. For decades, nosema disease was exclusively 



attributed to N apis, first described in 1909 [33]. In 
2005, a natural infection of another microsporidia spe- 
cies, N ceranae, which was first found in the Asian 
honey bee Apis cerana [34], was identified in A. mellifera 
colonies [35,36]. Subsequently, N ceranae has become 
the predominant infection of two bee species [37-41] 
and has been implicated in honey bee Colony Collapse 
Disorder (CCD) [15,19,42-44], a malady that has deci- 
mated honey bee colonies across the U.S. and around 
the world [45-47]. The emergence of N ceranae has 
changed the epidemiological pattern of honey bee 
nosemosis, resulting in an urgent need for elucidating 
the genetic basis that defines the epidemiology and 
pathogenicity of the Nosema species. 

Microsporidia genomes are highly reduced and compact 
in nature. The size of microsporidia genomes has been de- 
termined for sixteen microsporidia species mostly based 
on karyotyping analysis and EST surveys and varies con- 
siderably, ranging from 2.3 to 23 Mbp (reviewed in 
[48,49]). The first full genome sequence of a microsporidia 
species, Encephalitozoon cuniculi, marked a significant 
milestone in the study of microsporidia biology [50]. The 
genome organization and gene content of E. cuniculi rep- 
resents the first evidence of microsporidia genomic reduc- 
tion (2.9 Mb) and has provided considerable functional 
insight on the evolution of microsporidia-host relation- 
ships at the biochemical and molecular levels. Since 
then, seven additional microsporidia genomes have 
been sequenced [51-57]. 

In 2009, using the whole-genome shotgun (WGS) 
sequencing approach, we sequenced, annotated, and 
analyzed the complete genome of N ceranae at 25X 
coverage [52], which represents a second full genome 
sequence of microsporidia. Like many microsporidia, N 
ceranae has a highly reduced and compact genome with 
a draft assembly of 7.86 Mbp. A total of 2,641 putative 
protein-coding genes were identified based on compu- 
tational analysis of the N ceranae genome sequences. 
By comparing the N ceranae putative protein-coding 
genes with those of fully sequenced E. cuniculi as well 
as with yeast, Saccharomyces cerevisiae, about 50% of 
N ceranae genes appeared to be conserved among 
microsporidia. These microsporidia specific genes are 
of special interest for identification of potential viru- 
lence factors and elucidation of molecular mechanisms 
of intracellular invasion of these obligate intracellular 
parasites. We have now extended our efforts for 
genome-wide analysis of insect microsporidia to a second 
species, N apis which is the first described microsporidia 
species of honey bees. In this report, we present a draft 
sequence of the N apis genome and an annotation of the 
N apis genome at the nucleotide and protein levels. We 
undertake comparative analyses of N apis with its sympat- 
ric congener N ceranae in order to identify candidates of 
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species-specific coding sequences that may be associated 
with adaptation and virulence of the intracellular parasites. 
We hope that our genomic studies of honey bee 
microsporidia parasites will lead to greater understanding 
of the biology and pathogenesis of these difficult-to-treat 
fungal pathogens and contribute to the worldwide efforts 
to manage honey bee diseases and improve pollination 
services provided by honey bees. 

Results and discussion 

Species specificity of A/. Apis 

Light microscopy revealed that fresh N. apis spores were 
oval shaped and varied in size, with a minimum of 
5.15 um and maximum of 7.415 um in length, and a 
minimum of 2.99 um and maximum of 4.836 um in 
width (average 6.41 ± 0.74 x 3.36 ± 0.39 um, N = 50) 
(Figure la). A single specific band was produced from 
the genomic DNA isolated from purified spores after 
PCR amplification with N. apis specific primers, and the 
sequence analysis confirmed the species specificity of the 
PCR fragment. No PCR product was generated when N. 
ceranae specific primers were added to the PCR mixture 
(Figure lb). 

General features of N. Apis genome assembly 

This Whole Genome Shotgun project has been depos- 
ited at DDBJ/EMBL/GenBank under the accession# 
ANPH00000000. The version described in this paper is 
the first version, ANPHO 1000000. The raw reads were 
submitted to the Sequence Read Archive (SRA) data- 
base with accession # SRA067788. There are 8,274,787 
nucleotides grouped in 1,138 contigs, with an N 50 
contig size of 14,029 bp, and a mean contig size of 
7,733 bp. These 1,138 contigs were placed into 554 scaf- 
folds linked through paired reads with an N 50 scaffold 
size of 24,309. The average number of contigs per 



scaffold was 2.05 with a minimum of 1 contig in a scaf- 
fold and maximum of 19 contigs in a scaffold. The scaf- 
folds contain 584 intra-scaffold, or sequence, gaps with 
an estimated average length of only 74 bp. This large 
number of short gaps is consistent with the limitations 
of sequencing high- AT content genomes. Of the 1,138 
contigs, 275 are larger than 10 Kbp and cover 63% of 
the genome. The average depth of coverage for non- 
repetitive contigs is 15X. Because of the surrogate- 
placement repeat handling of the Celera Assembler, the 
average coverage of all contigs, including repetitive 
regions of the genome, is 22X. Including both the total 
contig length and estimated gap lengths based on 
paired-reads, the genome of N. apis is estimated to be 
8.5 megabases (Mb). Sequencing and assembly statistics 
are summarized in Table 1. 

Contents of the N. Apis genome 

The main features of the N. apis genome in comparison 
with N. ceranae are summarized in Table 2. The com- 
parison of the size and contents of genomes indicated 
that the overall genetic structure, contents, coding 
capacity, and proteome complexity are very similar, 
reflecting the shared evolutionary history of both spe- 
cies. The search of the N. apis genome sequence assem- 
bly for repetitive DNA showed that the simple repeats 
and low complexity cover about 9.92% of the total se- 
quence data. There were 7,789 tandem repeats identified 
with dinucleotide "AT" having the most abundant 
repeats dispersed in the N. apis genome. Three long 
terminal repeats (LTR) were identified. Three piggyback 
and three gypsy transposable elements were identified in 
the genome. 

The N. apis genome is very AT rich, with an overall 
GC content of 18.78%, 7.2% lower than that of N. 
ceranae. The protein coding regions have significantly 
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Figure 1 Morphological and molecular confirmation of N. apis species status, (a) Light micrograph of N. apis spore. Scale bar: 10 uM 

(b) Agarose gel electrophoresis of PCR products showing the presence of a single N. apis infection in the DNA sample. The presence of a single 

N. apis infection in the DNA sample intended for 454 pyrosequencing was confirmed by species-specific PCR and further sequencing analysis. 
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Table 1 N. apis genome assembly statistics 


Component 




Value 


Scaffold 








Number of Total Scaffold 


554 




Minimum Scaffold Span 


335 




Maximum Scaffold Span 


154351 




Mean Span of Scaffolds 


15447 




Scaffold N50 


24309 




Total Bases in Scaffold 


8.5 Mb 


Contig 








Number of Total Contigs 


1,138 




Minimum Contig Length 


97 




Maximum Contig Length 


75980 




Mean Contig Length 


7482 




Contig N50 


14,029 



higher GC content (25%) compared to the overall GC 
content for the entire genome, indicating that GC con- 
tent is not uniformly distributed across the genome and 
supporting the notion that GC content is higher in cod- 
ing regions than in non-coding regions. The GC content 
also varies by codon position, with the 1st and 2nd posi- 
tions of the coding regions having greater GC content 
than the third (synonymous) position. 

N. apis displays bias in its codon usage and has a 
pattern similar to that of N. ceranae. Both N. apis and 
N. ceranae have similarly preferred or optimal codons 
that have an A or T ending. However, codons ending 
with A or T are used more frequently in N. apis while 
codons ending with C or G are used more frequently in 
N. ceranae (Figure 2). The Maximum-likelihood codon 
bias (MCB) analysis showed that N. apis has a relatively 
lower level of correlation between GC content at the 
third position and the MCB (R 2 = 0.379, p < 0.05), com- 
pared to N. ceranae (R 2 = 0.425, p < 0.05) (Figure 3). 



Table 2 Main features of the N. apis genome in 
comparison with N. ceranae 





N. apis 


N. ceranae 


Size (Mb) 


8.5 


7.86 


GC percentage (total genome) (%) 


18.78 


26 


GC percentage in coding sequences (%) 


25 


27 


GC percentage in 3rd position 


13 


18 


tRNA genes (#) 


62 


65 


Protein coding genes (CDSs) 


2771 


2614 


Gene density (genes/kb) 


0.326 


0.33 


Percent coding (%) 


26.8 


24.2 



The merged gene sets from the different programs 
resulted in a mixture of 2771 predicted protein coding 
regions (CDSs) made up of 2,159 (78%) intact and 612 
(22%) intron interrupted protein-coding genes. The 
introns were inserted in random places within the genes 
and the genes containing introns fell into different func- 
tional groups. The consensus sequences "A(G/T)GT" at 
the donor site and "A(G/A)" at the acceptor site were 
identified in most of the intron-containing genes 
(Additional file 1: Figure SI). The density of genes across 
the genome is 0.33, which is similar to that for N. 
ceranae. However, no significant correlation between the 
length of scaffolds and the level of gene density was 
identified (Figure 4). Among the predicted CDSs, there 
are 2,684 complete genes and 87 partial genes that lack 
either a start or stop codon, or both. The length of CDSs 
averaged 274.88 amino acid residues per CDS. 

Of these 2,771 CDSs, 2,105 (76%) had matches to the 
nr protein database (Threshold E value < le-03). Most of 
the sequences have an e-value between 1.0e-3 and 1.0e-25 
(Additional file 2: Figure S2a). Among the 2,771 CDSs, 
545 (20%) were assigned as conserved hypothetical pro- 
teins, 978 (35%) were assigned with GO terms, 582 
(21%) were unannotated proteins and 665 (24%) had no 
match to the databases (Figure 5). The sequences with- 
out positive BLAST hits may be explained by a low 
similarity of the genes to those functionally similar 
genes in the database, or new genes that possess new 
functions in the parasite. The sequence similarity of N. 
apis within the database by Blast search ranges from 
35% to approx 98% and peaks at 52% (Additional file 2: 
Figure S2b). The sequences of N. apis genes showed the 
most significant similarity to the sequences of N. ceranae 
followed by sequences of E. intestinalis and E. cuniculi 
(Additional file 2: Figure S2c). The majority of functional 
prediction of the coding sequences are obtained from the 
UniProt Knowledge Base (KB), a nonredundant protein 
database that includes PDS, UniProt, Swiss-Prot, TreMBL, 
and TAIR (Additional file 3: Figure S3a, 3b, and 3c). The 
N. apis reference names, gene prediction sources, the gene 
locations, coding strand, and conceptual translations are 
shown in a Gene-Finding File (GFF) formatted dataset 
(Additional file 4: Table SI). Table SI also lists Blast-hit 
annotation, Enzyme code (EC), Inter Pro motifs, signal 
peptide and transmembrane motifs. 

A search for motifs in the 100-bp upstream regions of 
start codons of N. apis genes showed an enrichment of 
TATA box motifs in the upstream of genes and also 
yielded three distinct MEME motifs. Of the three over- 
represented motifs, the "ACCCTT" motif conserved in 
N. ceranae and E. cuniculi [52] was particularly prominent 
(E = 1.14e-242), and was present in 37% of the predicted 
genes. The motif shows a preference for occurring pre- 
dominantly at the -15 bp upstream position of the start 
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Figure 2 Codon usage frequencies of N. apis (Blue) and N. ceranae (Green) gene groups. Each bar represents the proportion of all codons 
encoding a given amino-acid that are the specified codon. The values are one by definition for the single-codon amino acids, tryptophan (Trp) 
and methionine (Met). Both N. apis and N. ceranae share similar optimal codons. 



codon. The "ACCCTT" sequence logo and representative 
upstream sequences of start codon of N. apis genes are 
shown in Figure 6. 

The predictions by SignalP and TMHMM showed that 
399 proteins contained putative signal peptides with over 
70% of the probability and significantly high C, S and Y 
scores and that 348 proteins contained putative trans- 
membrane segments of some form with a score of equal 
or more than 0.3. Of 348 proteins that had membrane- 
spanning domains, 114 were predicted to contain an 
amino-terminal signal peptide. The signal peptides, and 
transmembrane motifs are also included in the dataset 
(Additional file 4: Table SI). 

The combination of noncoding RNA (ncRNA) predic- 
tion programs and BLAST similarity search algorithms 
identified tRNA genes, rRNA gene clusters (16S, 23S and 
5S), and small nuclear RNA (snRNA) genes. Both 
ARAGORN and tRNAscan-SE yielded similar results for 
tRNA prediction and showed a diversity of tRNA genes in 
the N. apis genome. There were 62 tRNA genes from 
ARAGORN and 58 genes from tRNAscan-SE as well as 
16 tRNA-derived pseudogenes in tRNAscan-SE. By com- 
bining predictions from two tRNA prediction programs, 
there were 60 overlapping tRNAs with 42 distinct antico- 
dons and gene lengths of 57 to 68 nucleotides. The 16S, 
23S, and 5S rRNA genes were organized as a typical co- 
transcribed operon and arranged in order of 16S-23S-5S 
separated by two internal transcribed spacers (ITSs) in a 
tandemly repeated manner. The 16S, 23S, and 5S genes 



were identified in twenty-six scaffolds and four scaffolds 
contain a complete locus of 16S, 23S, and 5S genes. The 
sequence comparisons revealed polymorphisms among 
copies of rRNA genes and ITSs with the majority of poly- 
morphisms found in the IGS region. The degree of 
intragenomic variability observed in 23 S rRNA genes was 
greater than in 16S and 5S rRNA genes. While both 16S 
and 5S genes displayed 2% intragenomic variability, the 
average intragenomic variability observed it the 23S genes 
was about 10% with one 23S rRNA gene showing extreme 
divergence and sharing only 82% nucleotide sequence 
homology with the other 23S rRNA genes. While se- 
quence alignment identified short conserved sequence 
motifs in ITSs, the intragenomic variability observed in 
ITSs was demonstrated not only by nucleotide substitu- 
tions but also by length variance across the genome. 

The comparison of the 2,771 putative N. apis proteins 
with the 2,614 N. ceranae proteins revealed a broadly simi- 
lar gene set shared between N. apis and N. ceranae. The 
reciprocal best BLAST search comparing the N. apis gene 
set and N. ceranae gene set yielded 1,356 one-to-one 
orthologs that were each others mutual best matches; 
thus, the orthologs corresponded to 49% of N. apis 
genes and 52% of N. ceranae gene sets. Of 1,356 
ortholog pairs for both N. apis and N. ceranae, 137 
orthologs had multiple hits in N. apis, suggesting the 
presence of paralogous proteins. The list of ortholog 
pairs in the genome of N. apis and N. ceranae can been 
found in Additional file 5: Table S2. 
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Figure 3 Maximum-likelihood codon bias (MCB) analysis. The 

GC content at the third sites of N. opis and N. ceronoe genes was 
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Functional annotation of protein coding genes 

Among sequences that were assigned with gene ontology 
(GO) functional classification terms, 103 out of 978 se- 
quences were further annotated with Enzyme Commission 
(EC) codes. Annotated sequences were further mapped 
to 50 different biological pathways. The KEGG pathway 
map revealed the presence of a large number of N. apis 
sequences involved in nucleotide metabolism including 
purine metabolism and pyrimidine metabolism. Genes 
encoded all key enzymes involved in glycolysis, the 
pentose-phosphate pathway, and the trehalose pathway 
identified in N. ceranae were also found in N. apis, indi- 
cating that carbohydrate metabolism is the main 
process responsible for energy generation in Noserna 
(Figure 7). N. apis also contained enzymes participating 
in amino acid and lipid metabolism. It was not unex- 
pected that N. apis lacked genes encoding enzymes in- 
volved in electron transfer chains and the tricarboxylic 
acid (TCA) cycle, which corroborates previous findings 



in other microsporidia and indicate a strong host de- 
pendence for energy production. The KEGG pathway 
map also revealed that N. apis harbored genes for 
aminoacyl-tRNA biosynthesis, a process that is essential 
for accurate translation of genetic information, and 
genes required for biosynthesis of secondary metabo- 
lites and microbial metabolism that are involved in the 
parasites growth, survival, and fecundity. 

Of protein sequences with assigned GO terms, most 
sequences have between 1 and 6 GO terms annotated. 
Annotation results and distribution, GO-level distribution, 
number of GO-terms for N. apis sequences with length 
(x-axis), annotation score distribution; and percentage of 
N. apis sequences with length (x-axis) annotated are 
shown in Additional file 6: Figure S4a, 4b, 4c, 4d, 4e and 
4d. Of the proteins that had matches to the nr protein 
database, the most abundant protein class is the binding 
proteins. Other highly abundant proteins include intracel- 
lular transport proteins, ribosomal proteins, and proteins 
involved in major functional categories such as RNA pro- 
cessing, transcription and translation initiation and regula- 
tion, post-traslational modification and energy production. 
Among GO terms of molecular function, "Binding" was 
the most predominant group. Of GO terms identified for 
biological process, "metabolic", was the most dominant 
term with 18 subcategories related to the metabolic 
process. Of categories enriched in the cellular component, 
protein complex, intracellular membrane-bounded and 
non-membrane bounded organelles were the largest 
groups (Figure 8 A, 8B, and 8C). 

Functional analysis showed that N. apis harbored a 
highly reduced mitochondrial homologue, the mitosome, 
and encoded a number of putative proteins that are 
candidate constituents of this mitosome. In total, 
seventeen genes encoding mitosomal proteins includ- 
ing ATP-binding cassette (ABC) transporter complex, 
sulfate transporter family protein, NADPH oxidoreduc- 
tase, ISCU, hydrolase, heat shock protein 70, cation 
efflux protein zinc transporter, glycerol-3-phosphate 
dehydrogenase, exportin, and pyruvate dehydrogenase 
El (PDH) were identified from the complete genome 
sequence of N. apis. Consistent with the finding in E. 
cuniculi, the genes encoding PDH El alpha and beta 
subunits were found in N. apis. However, PDH E2 and 
E3, which are components of the PDH complex, are ab- 
sent. This suggests that while PDH provides evidence 
of mitochondria-derived metabolic activity, the func- 
tion of PDH in microsporidia may be different from 
that in other eukaryotes, where it catalyzes the decarb- 
oxylation of pyruvate to acetyl-CoA and C0 2 . The 
identified mitosomal proteins clustered into different 
functional groups, including iron-sulfur (Fe-S) cluster 
assembly and export, transporter, chaperones and co- 
chaperones, tubulin and motor proteins, metabolic 
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Figure 4 The genome browser screenshot showing six-frame translation and % of GC content on sequenced scaffolds. 



processes and anti-oxidative stress. The proteins involved 
in Fe-S cluster assembly are essential to mitochondrial 
function under oxygen-limiting conditions and are the 
most prominent functional group within the predicted 
mitosomal proteins. The putative mitosomal proteins 
identified in N. apis are listed in Table 3. 

Virulence factors are molecules expressed and secreted 
by parasites that are keys for causing disease in the host as 
the parasites enter into and exit out of host cells, and 
inhibit certain host functions. Genome analysis of N. apis 
has identified genes encoding a wide array of potential 
virulence determinants. These potential virulence factors 
of N. apis include genes encoding 1) polar tube proteins 
defining structure and invasion mechanisms; 2) spore wall 
and anchoring proteins involved in host tissue recognition 
and the initiation of host cell invasion; 3) endochitinase 
and chitin synthase involved in spore-host cell attachment; 
4) proteins involved in pathogenicity regulations and 
energy parasitism, 5) components of the spliceosome 
machinery that plays a major role in the generation of 
proteomic diversity and proteins regulating splicing 
activity, and 6) putative secretive proteins that are 
involved in interactions with the host and are essential 



for modulation of host immunity towards successful 
completion of the parasite lifestyle in the host (Table 3). 

Functional conservation and divergence of protein 
sequences between N. Apis and N. Ceranae 

While putative proteins of N. apis share significant se- 
quence homology and functional similarity for enriched 
GO terms with N. ceranae, the comparison of the GO 
terms based on Fishers exact test after a false discovery 
rate correction (p<0.05) displayed some differences 
between the two species. Comparison of biological pro- 
cesses (Figure 8A) showed that genes involved in the de- 
velopmental process, organismal development, signaling, 
kinase activity, translational activity were on the same level 
of abundance for both species. However, genes that are 
functionally linked to cell amino acid derivative meta- 
bolic process and reproduction were significantly more 
common in N. apis than in N. ceranae. In contrast, genes 
involved in metabolic processes and stress response were 
greatly expanded in N. ceranae. The protein transport 
genes were also found to have greater representation in 
N. ceranae than in N. apis. N. apis and N. ceranae also 
displayed different protein profiles responsive to stress 
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Figure 5 Percentage of N. apis predicted genes as annotated 
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and proteins without blast hit. Annotated proteins share similarity 
with GO-annotated proteins. Unannotated proteins do not share 
sufficient similarity with GO- annotated proteins. Hypothetical 
proteins share similarity with previously published proteins of 
unknown function. Unclassified proteins have no sequence similarity 
with any sequence in public databases. The percentage for each 
group is given in parentheses. The annotated proteins were further 
classified into three GO categories: biological process, molecular 
function, and cellular component (Level 2). The y-axis indicates the 
percentage of a specific category of genes within the 
main categories. 



and stimuli. Genes responsive to stress were over repre- 
sented in N. ceranae compared to N. apis, while N. apis 
contained a slightly greater number of genes responding 
to abiotic stimuli. Both species had about the same 
number of genes related to responses to endogenous 
stimuli. While cell homeostasis and response to external 
stimulus genes were only identified in N. ceranae, sym- 
biosis and encompassing, mutualism through parasitism 
genes were unique to N. apis. The genes related to re- 
sponse to external stimuli were present only in N. ceranae 
and absent in N. apis. These gene families may therefore 
represent species-specific proteins. For molecular function 
GO terms, a number of differences were observed 
between the two Nosema species (Figure 8B). Structure 
molecule activity proteins that contribute to the structural 
integrity of a complex or assembly within or outside a cell 
were found significantly reduced in N. ceranae, compared 
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Figure 6 Putative regulatory C-rich motif in upstream regions 
of start codons of N. apis genes. Sequence logo of motif 
indentified by MEME and representative upstream sequence of start 
codon of N. apis genes showing the putative regulatory motif. 



to N. apis. The calcium binding protein that was not in 
N. apis was present in N. ceranae. An enrichment test 
of cellular components showed that genes encoding 
nucleolus, Golgi apparatus, cytosol, nuclear chromo- 
some, mitochondrion and plasma membrane were at 
approximately similar levels of abundance for both spe- 
cies. However, intracellular membrane-bounded organ- 
elles including mitochondrion and endoplasmic reticulum 
were overrepresented in N. ceranae relative to N. apis. In 
contrast, nucleoplasm was significantly more common 
in N. apis than in N. ceranae. The ribosome was also 
found to be more represented in N. ceranae in relation 
to N. apis. The vacuole and vesicle proteins were 
present in N. ceranae but absent in N. apis, suggesting 
species-specific genes involved in structure and func- 
tion of cellular components. 

Conclusions 

Nosema disease is regarded as one of the most destructive 
and widespread adult honey bee diseases in existence and 
has become a serious problem in the beekeeping industry 
worldwide because of an alarming increase in the diseases 
prevalence as well as its association with the widespread 
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collapse of honey bee colonies. The sequencing and anno- 
tation of the N apis genome provide a comprehensive 
overview of genetic content, structure and organization 
of the parasite and give some interesting insights into 
the complex biological and molecular processes of the 
parasite. The comparison of the N. apis genome to its 
sympatric congener, N ceranae, provides an opportun- 
ity to define a subset of genes and their associated path- 
ways that may be responsible for the pathogenesis of 
Nosema disease and identify potential virulence factors 
that could be used as potential drug targets and for the 
development of novel antimicrobial agents. The identifi- 
cation of an overrepresented (ACCCTT) motif in the 
upstream promoter regions of 37% of N. apis genes led 
us to wonder if these genes have similar patterns of pro- 
moter activity and are regulated by the same transcrip- 
tional factors. Post-genomic experimental investigations 
are warranted to identify possible transcriptional factors 
and to determine the roles of the motif in the gene 
expression profile of Nosema, parasitic processes of de- 
velopment and host invasion. The transcription factors 
may offer an attractive opportunity to exploit RNAi- 
based targets for therapeutic intervention of Nosema 
disease. 



Genomic data of N apis and N ceranae showed that 
overall genetic structure, contents, coding capacity and 
proteome complexity of N apis and N ceranae are 
similar, suggesting a shared evolutionary history for the 
two species. Previous studies have shown that variability 
in codon usage is correlated strongly with GC content, 
particularly at the third codon position (GC3), and also 
that highly expressed genes exhibit very high levels of 
codon bias. The evolution and functional significance of 
the genome variation in GC content and codon usage 
bias have been widely reported between organisms and 
also within a genome; however, the underlying causes of 
these variations remain poorly elucidated and there has 
been a long debate over whether these variations are se- 
lected or neutral traits [58]. It is hard to define whether 
selection plays a role in codon usage bias in N apis with 
observed weak correlation between codon usage bias 
and the GC content at the third site (GC3). The slight 
difference in correlation of genomic GC content and 
codon usage between N apis and N ceranae may reflect 
an evolutionary divergence or may simply be due to 
gene length effects of the two species. 

The large number of orthologous protein groups shared 
by N apis and N ceranae and similar biochemical 
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Figure 8 Distribution of enriched functions in A) Biological Process (BP), B) Molecular Functions (MF) and C) Cellular Component (CC) 

by N. apis and N. ceranae. For each panel, the pie displays the distribution of annotated sequences (%) by BP, MF, or CC in both N. opis and 
N. ceronoe datasets. The stars and diamond shaped squares mark more than 2% difference of enriched GO terms between two species. The stars 
indicate specific GO terms that are significantly more prevalent in N. ceronoe than in N. opis by functional enrichment analysis. The diamond 
shaped squares indicate specific GO terms that are significantly more prevalent in N. opis than in N. ceronoe by functional enrichment analysis. 
The cycles indicated that GO are unique to either one species of Nosemo. 



pathways encoded by N. apis and N. ceranae suggests that 
the core genes and their associated functions have been 
retained in the genus Nosema. The inferences of biological 
process, molecular function and cellular component 
categorizations based on predicted gene products, how- 
ever, distinguish the two species. Although N. apis and 
N. ceranae lack the sophisticated biosynthetic pathways 
that are necessary for energy production, biochemical 
analyses showed that ATP transporter proteins are 
enriched in both parasites, largely reflecting their host- 
dependent intracellular lifestyle. The key enzymes in- 
volved in energy transporter activity and metabolic 
process are more represented in N. ceranae than N. apis, 
suggesting that N. ceranae may have a greater capacity for 
biosynthesis and better ability to obtain energy from its 
host, compared with N. apis. Although N. ceranae is a 
comparatively recent introduction into populations of 
European honey bees, N. ceranae is the more prevalent of 
the two Nosema species in honey bees and appears to be 
replacing N. apis in some populations of A. mellifera 
[37,43,59,60]. The high survival capacity of N. ceranae 
would allow the parasite to consume the host for its own 
growth and multiplication and therefore reduce the hosts 
fitness over time but generally not kill it. In contrast to the 
distinctive disease symptoms caused by N. apis, such as 
dysentery and crawling behavior, N. ceranae infection is 
generally not associated with external symptoms. Instead, 
infected colonies have been found to decline gradually and 
collapse over a period of time. The differences in host 
pathogenicity between N. apis and N. ceranae may be, in 
part, a result of their differences in molecular function and 



biological process. The novel sequences of N. apis without 
detectable homologs in N. ceranae and vice versa suggests 
that there has been gene shuffling along the evolutionary 
pathway separating N. apis and N. ceranae from their 
common ancestor. 

The success of the parasite also relates to its ability to 
respond to stress and endogenous stimuli within hosts. 
The result that proteins involved in response to stress and 
endogenous stimulus were more represented in N. ceranae 
than N. apis suggests that N. ceranae may have a better 
ability to survive under stressful conditions such as host 
immune functions. Alternatively, N. apis may be able to 
sense and respond rapidly to abiotic stimuli such as envir- 
onmental changes, as categories related to the response to 
abiotic stimuli were more represented in N. apis than in 
N. ceranae. This assumption is supported by recent find- 
ings that N. ceranae and N. apis responded differently to 
decreasing fumagillin levels [31] and that N. apis was 
more resistant to cold temperatures than N. ceranae [61]. 

Discovering virulence factors is a key to understanding 
microsporidia pathogenesis and their interactions with 
their host and for identifying targets of novel drugs. 
The data from the genome sequences of N. apis as well 
as N. ceranae have allowed us to identify a number of 
putative virulence factors. For example, host tissue recog- 
nition and polar tube extrusion for initiating cell invasion 
are important factors in the pathogenesis of microsporidia 
parasites. Parasite spore wall and polar tube proteins that 
are essential for host cell specificity and cell invasion pro- 
cesses therefore could be potential targets for develop- 
ment of antigen-based diagnostic tests, antimicrobials and 
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Table 3 N. apis sequences that show homology to annotated and experimentally verified pathogenicity and virulence 
genes of microsporidia species 

Product/Gene Sequence Function Homologous to N. Organism whose annotated sequence has Reference 

ID ceranae top BlastP hit 



1) Polar Tube Protein 

PTP2 
PTP3 
PTP3 

2) Spore Wall Protein 

SWP12 
SWP25 



NAPIS-00435 

NAPIS-01921 Host cell invasion 
NAPIS-01922 

NAPIS-00521 
NAPIS-00461 

NAPIS-00514 Structural capacity 



Yes 
Yes 
Yes 

Yes 
Yes 
Yes 



N. bombycis 
N. bombycis 
N. bombycis 

N. bombycis 
N. bombycis 
N. bombycis 



[77] 
[77] 
[77] 

[78] 
[79] 
[79] 



SWP26 
SWP32 
3) Endochitinase 



4) Chitin synthase 



6) Mitogen-activated 
protein kinase 



7) ATP/AD P Translocase 



NAPIS-01092 
NAPIS-00203 
NAPIS-00269 

NAPIS-02055 
NAPIS-02138 

NAPIS-01775 
NAPIS-02139 



NAPIS-01928 



host cell adhesion 



Spore attachment 



Spore attachment 
& pathogenicity 



Pathogenicity 
Regulation 



8) Transferase 



Yes 
Yes 
Yes 

Yes 
No 

Yes 
Yes 



Yes 



NAPIS-01593 

NAPIS-01 593 Energy parasitism 
NAPIS-02128 
NAPIS-02287 

NAPIS-00457 

NAPIS-00459 Pathogenicity 
NAPIS-00459 regulation 
NAPIS-00578 
NAPIS-00596 

9) Splicing machinery 

NAPIS-00564 

NAPIS-00654 Pathogenicity 
NAPIS-01 01 2 regulation 
NAPIS-02190 

9) Putative secretive proteins (399 proteins listed in Additional file 4: 



Yes 
Yes 
Yes 
Yes 

No 
Yes 
Yes 
No 
Yes 

Yes 
Yes 
Yes 
Yes 
Table SI) 



N. bombycis 
N. bombycis 
N. bombycis 

E. hellem 
E. hellem 

E. intestinal is 
E. hellem 



E. intestinal is 



E. cuniculi 

E. cuniculi 

E. cuniculi 

E. cuniculi 

E. hellem 
E. cuniculi 
E. intestinal is 
E. hellem 
E. hellem 

E. intestinal is 
E. intestinal is 

E. hellem 
E. romaleae 



[79] 
[80] 
[78] 

[56] 
[56] 

[53] 
[56] 



[53] 



[50] 
[50] 
[50] 
[50] 

[56] 
[50] 
[53] 
[56] 
[56] 

[53] 
[53] 
[56] 
[56] 



siRNA molecules. The proteins secreted from cells into 
the extracellular space mediate important parasite-host 
interactions and therefore are also interesting targets for 
innovative therapeutics. Future experimental evaluation of 
these virulence factors of microsporidia through molecular 



manipulations such as RNA interference (RNAi) [62] 
should provide valuable insight into their usefulness in 
drug development. 

Although they share some morphological, biological 
and genetic features, as well as invasive strategies, the 



Chen et al. BMC Genomics 2013, 14:451 
http://www.biomedcentral.eom/1 471 -21 64/1 4/451 



Page 12 of 16 



genome sequences of N apis and N. ceranae revealed 
that these microsporidia species are highly diversified 
and that not all microsporidia genomes are characterized 
by reductions in size and complexity. Nosema genomes 
appear to be much less compact than the smallest 
microsporidia genomes, such as E. cuniculi, providing 
new insight into the adaptive evolution caused by the 
host-parasite relationship across the microsporidan 
groups. The availability of draft sequences of the N apis 
genome is a significant addition to the rapidly expanding 
body of microsprodian genomic data which has been 
improving our understanding of eukaryotic genome 
diversity and evolution in a broad sense. 

Methods 

Bee sample collection and spore purification 

Adult worker bees infected with N apis were collected 
from an apiary infected exclusively with N. apis in 
Beaverlodge, Alberta, Canada in 2007. The species status 
of the Nosema infection was determined by a combination 
of morphological and molecular methods. In order to pur- 
ify spores of N. apis from the infected bees, alimentary 
tracts were individually removed by grasping the stinger 
with forceps and pulling it out with gentle traction. The 
infected alimentary tracts were homogenized in 3 ml 
sterile water using a conical ground-glass homogenizer 
with a rough grinding surface and filtered through a 
Corning (Lowell, MA) Netwell insert (24 mm diameter, 
74 mm mesh size) to remove large-sized tissue debris. The 
filtered suspension was centrifuged (Rotor: Sorvall RC-5B) 
at 3,000 g for 5 minutes to pelletize the spores. The 
resuspended pellet was then purified on a discontinuous 
Percoll (Sigma-Aldrich, St. Louis, MO) gradient consisting 
of 10 ml each of 30%, 60%, and 90% Percoll solution by 
overlaying the spore suspension onto the gradient and 
centrifuging at 8,000 g for 10 minutes. The pelleted spores 
were further purified with 30 ml of 100% Percoll solution 
by centrifuging at 5,000 g for 2-3 minutes to separate the 
spores and lighter cellular debris into the top and bottom 
layers, respectively. The spore- containing supernatant was 
collected and diluted with water in the ratio of 1:1 and 
centrifuged at 8,000 g for 10 minutes to pellet the spores. 
The purity and size of spores were examined under a light 
microscope (Nikon, Eclipse TE 300) and photographed 
with a Nikon Digital Camera (DXM 1200). 

Genomic DNA isolation and N. Apis species confirmation 

The genomic DNA was isolated using CTAB (hexadecyl- 
trimethylammonium bromide) extraction buffer follow- 
ing spore disruption. Approximately 10 6 N apis spores 
were suspended in 500 ml CTAB buffer (100 mM Tris- 
HC1, pH 8.0; 20 mM EDTA, pH 8.0; 1.4 M sodium 
chloride; 2% cetyltrimethylammonium bromide, w/v; 
0.2% 2-mercaptoethanol) and broken by adding 500 mg 



of glass beads (425-600 mm, Sigma-Aldrich, St. Louis, 
MO) into the tube and disrupting the mixture at max- 
imum speed for 3-5 minutes using a FastPrep Cell Dis- 
rupter (Qbiogene, Carlsbad, CA). The mixture was then 
incubated with proteinase K (200 mg/ml) at 55°C for 
five hours and treated with chitinase (0.4 U) with an ex- 
tended incubation period of 16 hours at 55°C. The gen- 
omic DNA was twice extracted in an equal volume of 
phenol/chloroform/isoamyl alcohol (25:24:1), followed 
by a single extraction in chloroform. The purified DNA 
was precipitated with isopropanol, washed in 70% ethanol, 
and dissolved in 100 ul sterile water. The concentration 
and purity of the DNA were determined by spectrophoto- 
metric absorption at 260 nm, and ratios of absorption at 
260 nm and 280 nm. 

The Nosema species specificity of extracted genomic 
DNA was confirmed by a PCR assay prior to 454 pyro- 
sequencing. Primers specific to N apis and to N ceranae 
were used individually in the PCR-mediated amplification 
of genomic DNA isolated from purified spores. The se- 
quences of N apis primers that amplified 269-bp fragment 
of the 18S rRNA were as follows: apisF 5'- CCATTGCC 
GGATAAGAGAGT -3' and apisR 5 - CCACCAAAAAC 
TCCCAAGAG -3'. The PCR reaction mixture, thermal 
conditions, and sequences of N ceranae primers used 
were as previously described [41]. The amplified PCR 
products were purified using the Wizard PCR Prep DNA 
Purification System (Promega, Madison, WI) and the 
nucleotide sequence identity of the PCR fragments was 
determined using a Basic Local Alignment Sequence Tool 
(BLAST) homology search. 

Library construction, sequencing, and assembly 

Although the high% AT vs.% GC ratio of this genome 
poses challenges for all high-throughput sequencing plat- 
forms, a successful sequencing library was constructed, se- 
quenced and assembled through protocol optimizations. 
The 454 Titanium sequencing platform offers an ideal 
combination of throughput and long read length to enable 
de novo sequence assembly of whole genomes. A paired- 
end library was constructed according to the manufac- 
turers protocol, with modifications and optimizations to 
enable robotic automation of crucial steps and manipula- 
tion of limited, high-% AT starting material. An initial 
sample of 500 ng DNA was sheared using a Covaris E210 
high-performance ultra-sonicator. Following adaptor 
ligation and double AMPure SPRI bead-based size selec- 
tion, the resulting library had an average fragment size of 
2,072 bp with a standard deviation of 208 bp. This library 
was sequenced on a full two-region 454 Titanium run 
resulting in 1,070,575 passed-filter sequence reads with an 
average quality-trimmed length of 304 bp. During data 
quality control procedures, 126,461 reads, or 11.8% of the 
total, were eliminated as PCR-induced duplicates. This 
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represents a typical duplicate read percentage for the 454 
sequencing platform. An additional 10% of the reads were 
eliminated as too short for assembly or containing an 
inconsistent or partial paired-end linker sequence. Of the 
remaining reads, 360,731 were bisected at the linker to 
form successful pairs, while 475,944 remained as unpaired 
shotgun reads. 

These data were assembled using the Celera Assembler 
(http://sourceforge.net/projects/wgs-assembler/) version 6.1 
using the CABOG module, an overlap-layout-consensus 
sequence assembler that has been recently optimized 
for use with 454 Titanium paired-end sequence data. 
The Celera Assembler uses a stringent algorithm that 
heavily weighs paired-end read constraints and favors 
shorter, higher-confidence contigs over longer, poten- 
tially misassembled contigs. This approach ensures a 
highly accurate result with minimal mis -assembly. 

Repetitive element identification 

Genome analysis of N. ceranae showed the presence of a 
large proportion of the repetitive elements which could 
interfere with the accuracy of de novo gene prediction. We 
therefore used RepeatMasker (http://www.repeatmasker. 
org), Tandem Repeat Finder [63], and LTR Finder [64] in 
an ab initio manner with the default settings to identify 
the repetitive elements. Repetitive elements and low 
complexity DNA sequences in the N. apis genome were 
then masked for elimination. To avoid the incorrect 
identification of repetitive elements and low complexity 
from DNA sequences, manual inspection of the list of 
repeats was carried out. The repeats that had significant 
Basic Local Alignment Search Tool (BLAST) hits to the 
NCBI protein database (E-value cutoff of 1.0E-5) were 
kept for subsequent gene prediction and annotation. 

Protein-coding gene prediction and regulatory motif 
identification: 

After the exclusion of a few contaminating honey bee 
host sequences identified through BLASTN searches 
against the bee genome, a combination of two programs 
was used for identification of protein-coding sequences 
(CDSs) in the masked N. apis genome: GeneMark.hmm. 
ES (version 3.0) using non-supervised training procedures 
and Augustus trained with E. cuniculi, and S. cerevisiae. 
The GeneMark.hmm [65] and AUGUSTUS [66] programs 
for eukaryotic genomes access the coding potential of 
DNA sequence based on Hidden Markov Models 
(HMM). Next, outputs of supervised gene prediction 
based on sequence similarity to reference organisms 
(extrinsic method) and ab initio gene prediction relying 
on intrinsic features of the DNA sequence (intrinsic 
method) were merged and reconciled with other gene 
models. The CD-HIT-2D [67] was used for identifying 
overlapping and unique sets of proteins from different 



gene prediction programs. For overlapping predictions 
produced by different programs, the prediction with 
BLAST evidence was kept. If the overlapping predictions 
by different programs did not have BLAST evidence or if 
they all had BLAST evidence, the longest prediction was 
chosen. To verify the correct prediction of protein-coding 
genes, all inferred genes larger than 50 aa were further 
compared across an NCBI non-redundant protein data- 
base using BLASTP program with default parameters. 
Frame-shifts and point mutations were detected and 
corrected where appropriate. Candidates with a protein 
identity match > 30%, and BLASTP value < 10-5 were 
retained as protein-encoding gene candidates. 

Nucleotide composition and synonymous codon usage 
of protein coding genes were investigated using the pro- 
gram INCA2.0 by [68]. The maximum-likelihood codon 
bias (MCB) was used as a measure for codon usage bias 
where the contribution to the index of the bias of each 
amino acid is weighted by an estimation of the likeli- 
hood of occurrence of bias on each amino acid [69] . The 
relative synonymous codon usage (RSCU) of individual 
codons between N. apis and N. ceranae were compared. 
Putative transcriptive regulators were searched by identi- 
fying over-represented motifs in the 100-bp upstream 
region of the start codon of each protein-coding gene. 
The MEME 3.0 software program [70] was used for the 
motif identification. 

Signal peptide and transmembrane region prediction 

The N-terminal amino acid sequence of predicted protein 
from this study was analyzed for the presence of putative 
signal peptides using the SignalP 3.0 program [71], and a 
neural network and HMM trained on eukaryotes. The 
transmembrane regions of predicted protein were identi- 
fied with TMHMM 2.0, a HMM based transmembrane 
helix prediction program. The architecture of the hidden 
Markov model and the neural network algorithm was 
applied for prediction of the signal peptide and trans- 
membrane region of each translated protein. The de- 
fault parameters were used for both programs. 

Non-coding RNAs 

tRNA genes were detected using two tRNA identification 
programs, tRNAScan-SE version 123 (Lowe and Eddy, 
1997) and ARAGORN [72] at the default settings. The 
tRNAScan-SE search was based on tRNA training data 
from a mixed general tRNA model. The ARAGORN 
search was based on the homology of recognized tRNA 
consensus sequences by specifying a yeast nuclear genome 
(NC_001133-NC_001148) and the ability of input data to 
form a base-paired cloverleaf to predict tRNA secondary 
structure. Two programs are also able to distinguish tRNA 
pseudogenes from true genes with high sensitivity. 
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Ribosomal genes coding rRNA in a full genome se- 
quence were identified using both RNAmmer 1.2 server 
[73] with the default parameters and BlastN searches. 
Matches were defined as those sequences having greater 
than 90% similarity and greater than 90% of the query 
length. Small nuclear RNAs (snRNA) were also identi- 
fied by BlastN searches. rRNA sequences were annotated 
based on homology to previously published rRNA se- 
quences deposited in GeneBank. 

Ortholog and paralog identification 

The possible orthologs of N. apis in N. ceranae, E. cuniculi 
or S. cerevisiae genomes were predicted by searching for 
the best reciprocal BLASTP matches with InParanoid 
[74]. In each of the between-species comparisons, we 
aligned protein sequences from one species to protein 
sequences from the other, and corresponding best matches 
were then aligned back to the original species. On the 
basis of all of the possible between-species sequence pairs, 
we reported orthologs and in-paralogs that had multiple 
hits in the same genome with a bit score cutoff of 40 and 
an aligned region larger than 50% of the longer query se- 
quence in each pair. We also performed bootstrap analysis 
to obtain confidence for each ortholog identified. Gene 
duplications were identified with a focus on proteins in 
the data set that had a better match within the N. apis 
genome they reside (excluding the self match) when com- 
pared against all other genomes used in the study. In order 
to avoid assigning paralogs to ortholog pairs, the top hit 
had to have an E-value 10 5 times lower (more significant) 
than the next best hit. In addition, the complete genome 
sequence of N. apis was aligned individually to the 
complete genome of N. ceranae to find the corresponding 
orthologous blocks in the genomes and to illustrate homo- 
logs between genomes by using the Mauve 2.2. [75]. 

Protein domain functional annotation 

The predicted proteins were analyzed using the 
BLASTP algorithm to search for homolog sequences 
and functionally annotated by gene ontology (GO) 
terms, InterPro terms (Conserved patterns in se- 
quences, InterProScan, EBI), enzyme classification (EC) 
codes, and metabolic pathways (Kyoto Encyclopedia of 
Genes and Genomes, KEGG) using the Blast2GO (B2G) 
software suite v2.4.2 (www.BLAST2go.org, [76]. The 
functional annotation in B2G was performed using the 
following steps: 1) performing BlastP to final homolo- 
gous sequences to the query set; 2) mapping to collect 
gene ontology terms (GO) associated to the hits after a 
BLASTP search; 3) selecting GO terms from the GO 
pool collected in the mapping step and assigning them 
to the query sequence; 4) grouping the descriptions into 
'GOslim" categories for biological process, molecular 
function, and cellular component, and 5) performing 



Enzyme code and InterPro annotations support lan- 
guage by GO terms. Finally, an enrichment analysis was 
performed to assess the differences in functional classes 
between N. apis and N. ceranae using one-way Fisher's 
Exact Tests. The default settings of B2G were used in 
every annotation step. The false discovery rate (FDR) at 
level a = 0.05 was applied for controlling procedures. 
Gene product names are assigned based on sequence 
homology to known genes in NCBI databases. The 
putative proteins without GO annotation were classified 
as "hypothetical protein with no significant database hit". 

Ethics statement 

The apiaries for bee sample collection are the property 
of the USDA-ARS Beltsville Bee Research Laboratory, 
MD, USA and the Agriculture & Agri-Food Canada, 
Beaverlodge Research Farm, Beaverlodge, AB, Canada; 
therefore, no specific permits were required for the 
described studies. Studies involved the European honey 
bee (Apis mellifera), which is neither an endangered nor 
protected species. 

Additional files 



Additional file 1: Figure SI. Exon-intron junctions in most of the 
intron-containing protein-coding genes of N. apis. The consensus 
sequences "A(G^0GT" at the donor site and "A(G/A)" at the acceptor site 
were indicated by arrows separately. 

Additional file 2: FigureS2. Characteristics of homology searches of N. 
apis protein-coding genes against the non-redundant protein sequences 
(nr) at NCBI using Blastp. (a) E-value distribution of the top BLAST hit for 
each unique sequence with a cut-off E-value of 1.0e-3. The most 
sequences have an e-value between 1.0e-3 and 1.0e-25; (b) Similarity 
distribution of the top BLAST hit for each unique sequence. The 
sequence similarity of N. apis with database by Blast search ranges from 
35% to approx 98% and peaks at 52%; and (c) Top-species distribution of 
the top BLAST hit for each unique sequence. The sequences of N. apis 
sequences showed the most significant similarity to the sequences of N. 
ceranae followed by sequences of £ intestinalis and E. cuniculi. 

Additional file 3: Figure S3. Mapping of N. apis protein-coding genes 
to GO terms associated to Blastp hits, (a) Evidence code distribution for 
BLAST hits. The evidence code distribution for BLAST hits chart shows an 
overrepresentation of Inferred Electronic Annotation (IEA), followed by 
Inferred by Direct Assay (IDA); (b) Evidence code distribution for 
individual sequences. The highest evidence code for the individual 
sequences was through Inferred Electronic Annotation (IEA), second by 
Inferred by Direct Assay (IDA) and third by Inferred by Mutant Phenotype 
(IMP); (c) Mapping database sources. The majority of N. apis genes are 
obtained from the UniProt Knowledge Base (KB), a nonredundant protein 
database that includes PDS, UniProt, Swiss-Prot, TreMBL, and TAIR. 

Additional file 4: Table SI. Predicted coding genes and annotations. 

Additional file 5: Table S2. Complete list of the ortholog pairs in the 
genome of N. apis and N. ceranae. 

Additional file 6: Figure S4. Functional assignment terms to query 
sequences from the pool of GO terms gathered in the mapping step, (a) 
Annotation results. Of CDSs of N. aps with blast hit, 545 were assigned as 
conserved hypothetical proteins. The subsequent mapping assigned GO 
terms to 978 positive BLAST hit sequences, (b) Annotation distribution; 
Most sequences have between 1 and 6 GO terms annotated; (c) GO-level 
distribution. N. apis sequence GO terms representation for biological 
process (BP), molecular function (MF) and cellular component (CC) 
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ontologies. The mean GO-level is 5.749 and 5155 annotations could be 
assigned; (d) Number of GO-terms for N. apis sequences with length (x). 
The length of most GO term annotated N. apis sequences were are in 
the 80-700 bp range; (e) Annotation score distribution; and (f) 
Percentage of N. apis sequences with length (x) annotated. 
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